# README

# Overview

In the following sections, we summarize the implementation details of our proposed method, LaVCa, as well as the in-house implementation of the existing method, BrainSCUBA, which is used for comparison.


---

## Data Preparation

### 1. NSD Dataset
- Download the **Natural Scenes Dataset (NSD)** from the following website:
  - [https://naturalscenesdataset.org/](https://naturalscenesdataset.org/)

### 2. OpenImages
- Download **OpenImages** from:
  - [https://storage.googleapis.com/openimages/web/download_v6.html](https://storage.googleapis.com/openimages/web/download_v6.html)

### 3. Package Installation
- Install all required Python packages:
  ```bash
  pip install -r requirements.txt
  ```

---

## Running the Pipelines

*Note: Each Python script contains detailed instructions on how to run it. Below is an outline of the main steps.*

### LaVCa (in the `LaVCa` directory)

1. **Step 0**: Feature Extraction
   - **Script**: `step0_feature_extracting.py`
   - **Description**: Extract CLIP-Vision features for building voxel-wise encoding models (used in Step 1).

2. **Step 1**: Encoding Model Construction
   - **Script**: `step1_encoding.py`
   - **Description**: Construct voxel-wise encoding models for each subject viewing natural images.

3. **Step 2**: Search for Optimal Images
   - **Script**: `step2_search_optimal_images.py`
   - **Description**: Identify the top-`N` images that most strongly activate each voxel based on the trained encoding models.

4. **Step 3**: Captioning Optimal Images
   - **Script**: `step3_captioning_opt_images.py`
   - **Description**: Generate captions for the selected “optimal images” using a Multimodal Large Language Model (MLLM). These captions will be summarized in the next step.

5. **Step 4**: Voxel Caption Generation
   - **Script**: `step4_caption_generation.py`
   - **Description**: Extract keywords from the image captions, filter them, and use a “Sentence Composer” to generate concise voxel-level captions.

---

### BrainSCUBA (in the `BrainSCUBA` directory)

1. **Step 0**: Feature Extraction
   - **Script**: `step0_feature_extracting.py`
   - **Description**: Extract CLIP-Vision features for building voxel-wise encoding models (used in Step 1).

2. **Step 1**: Encoding Model Construction
   - **Script**: `step1_encoding.py`
   - **Description**: Construct voxel-wise encoding models for each subject viewing natural images.

3. **Step 2**: Caption Generation
   - **Script**: `step2_caption_generation.py`
   - **Description**: Convert voxel-wise encoding weights into image features, then feed these features into an image captioning model to obtain voxel captions.

---

## Evaluation (in the `evaluation` directory)

There are two main types of evaluations:

1. **Sentence-Level Evaluation**
   - **Script**: `eval_sentence_similarity.py`
   - **Description**: Evaluate sentence-level similarity between the generated voxel captions and the ground-truth captions from the NSD dataset.

2. **Image-Level Evaluation**
   1. **Image Generation**  
      - **Script**: `caption2image.py`  
      - **Description**: Convert the voxel captions into images using an image generation model.

   2. **Image Similarity**  
      - **Script**: `eval_image_similarity.py`  
      - **Description**: Evaluate the similarity between the generated images (from voxel captions) and the original NSD images.

---

## References
- **Himalaya**: [https://github.com/gallantlab/himalaya](https://github.com/gallantlab/himalaya)  
- **CLIP**: [https://github.com/openai/CLIP](https://github.com/openai/CLIP)
- **MeaCap**:  [https://github.com/joeyz0z/MeaCap](https://github.com/joeyz0z/MeaCap)

---

